INTERSPEECH.2014 - Speech Synthesis

Total: 51

#1 Using conditional random fields to predict focus word pair in spontaneous spoken English [PDF] [Copy] [Kimi1]

Authors: Xiao Zang ; Zhiyong Wu ; Helen Meng ; Jia Jia ; Lianhong Cai

This paper addresses the problem of automatically labeling focus word pairs in spontaneous spoken English, where a focus word pair refers to salient part of text or speech and the word motivating it. The prediction of focus word pairs is important for speech applications such as expressive text-to-speech (TTS) synthesis and speech recognition. It can also help in better textual and intention understanding for spoken dialog systems. Traditional approaches such as support vector machines (SVMs) prediction neglect the dependency between words and meet the obstacle of the imbalanced distribution of positive and negative samples of dataset. This paper introduces conditional random fields (CRFs) to the task of automatically predicting focus word pair from lexical, syntactic and semantic features. Furthermore, several new features related to syntactic and semantic information are proposed to achieve better performance. Experiments on the publicly available Switchboard corpus demonstrate that CRF model outperforms the baseline and SVM model for focus word pair prediction, and newly proposed features can further improve performance for CRF based predictor. Specifically, compared to the low recall rate of 11.31% achieved by the SVM model, the proposed CRF based predictor can yield a high recall rate of 70.88% with little impact on precision.

#2 Applications of maximum entropy rankers to problems in spoken language processing [PDF] [Copy] [Kimi1]

Authors: Richard Sproat ; Keith Hall

We report on two applications of Maximum Entropy-based ranking models to problems of relevance to automatic speech recognition and text-to-speech synthesis. The first is stress prediction in Russian, a language with notoriously complex morphology and stress rules. The second is the classification of alphabetic non-standard words, which may be read as words ( NATO), as letter sequences USA, or as a mixed ( mymsn). For this second task we report results on English, and five other European languages.

#3 Text-to-speech with cross-lingual neural network-based grapheme-to-phoneme models [PDF] [Copy] [Kimi1]

Authors: Xavi Gonzalvo ; Monika Podsiadło

Modern Text-To-Speech (TTS) systems need to increasingly deal with multilingual input. Navigation, social and news are all domains with a large proportion of foreign words. However, when typical monolingual TTS voices are used, the synthesis quality on such input is markedly lower. This is because traditional TTS derives pronunciations from a lexicon or a Grapheme-To-Phoneme (G2P) model which was built using a pre-defined sound inventory and a phonotactic grammar for one language only. G2P models perform poorly on foreign words, while manual lexicon development is labour-intensive, expensive and requires extra storage. Furthermore, large phoneme inventories and phonotactic grammars contribute to data sparsity in unit selection systems. We present an automatic system for deriving pronunciations for foreign words that utilises the monolingual voice design and can rapidly scale to many languages. The proposed system, based on a neural network cross-lingual G2P model, does not increase the size of the voice database, doesn't require large data annotation efforts, is designed not to increase data sparsity in the voice, and can be sized to suit embedded applications.

#4 Transform mapping using shared decision tree context clustering for HMM-based cross-lingual speech synthesis [PDF] [Copy] [Kimi1]

Authors: Daiki Nagahama ; Takashi Nose ; Tomoki Koriyama ; Takao Kobayashi

This paper proposes a novel transform mapping technique based on shared decision tree context clustering (STC) for HMM-based cross-lingual speech synthesis. In the conventional cross-lingual speaker adaptation based on state mapping, the adaptation performance is not always satisfactory when there are mismatches of languages and speakers between the average voice models of input and output languages. In the proposed technique, we alleviate the effect of the mismatches on the transform mapping by introducing a language-independent decision tree constructed by STC, and represent the average voice models using language-independent and dependent tree structures. We also use a bilingual speech corpus for keeping speaker characteristics between the average voice models of different languages. The experimental results show that the proposed technique decreases both spectral and prosodic distortions between original and generated parameter trajectories and significantly improves the naturalness of synthetic speech while keeping the speaker similarity compared to the state mapping.

#5 Cross-lingual voice conversion-based polyglot speech synthesizer for indian languages [PDF] [Copy] [Kimi1]

Authors: B. Ramani ; M. P. Actlin Jeeva ; P. Vijayalakshmi ; T. Nagarajan

A polyglot speech synthesizer, synthesizes speech for any given monolingual or multilingual text, in a single speaker's voice. In this regard, a polyglot speech corpus is required. It is difficult to find a speaker proficient in multiple languages. Therefore, in the current work, by exploiting the acoustic similarity of phonemes across Indian languages, a polyglot speech corpus is obtained for four Indian languages and Indian English, using GMM-based cross-lingual voice conversion. The optimum target speaker and GMM topology is chosen based on the performance of a speaker identification system. It is observed that, the language that shares the most number of phonemes with the other languages, serves as the best target. A polyglot speech corpus derived in this target speaker's voice, is further used to develop an HMM-based polyglot speech synthesizer. The performance of this synthesizer is evaluated in terms of speaker identity using ABX listening test, quality using mean opinion score (MOS) and speaker switching using subjective listening test.

#6 An investigation of the application of dynamic sinusoidal models to statistical parametric speech synthesis [PDF] [Copy] [Kimi1]

Authors: Qiong Hu ; Yannis Stylianou ; Ranniery Maia ; Korin Richmond ; Junichi Yamagishi ; Javier Latorre

This paper applies a dynamic sinusoidal synthesis model to statistical parametric speech synthesis (HTS). For this, we utilise regularised cepstral coefficients to represent both the static amplitude and dynamic slope of selected sinusoids for statistical modelling. During synthesis, a dynamic sinusoidal model is used to reconstruct speech. A preference test is conducted to compare the selection of different sinusoids for cepstral representation. Our results show that when integrated with HTS, a relatively small number of sinusoids selected according to a perceptual criterion can produce quality comparable to using all harmonics. A Mean Opinion Score (MOS) test shows that our proposed statistical system is preferred to one using mel-cepstra from pitch synchronous spectral analysis.

#7 Chaotic mixed excitation source for speech synthesis [PDF] [Copy] [Kimi1]

Authors: Hemant A. Patil ; Tanvina B. Patel

Linear Prediction (LP) analysis has proven to be very powerful and widely used method in speech analysis and synthesis. Synthesis by LP-based approach is carried by exciting an all-pole model (whose parameters are derived by LP analysis). Synthesis is carried by using mixed excitation source consisting of a sequence of impulses for voiced regions and white-noise source for unvoiced regions. In this paper, we present novel chaotic excitation source using chaotic titration method. The voiced and unvoiced regions in speech are modeled by chaos which is quantified by adding noise of known standard deviation (determined using chaotic titration method). It is observed that on an average for synthesized voices (both male and female), MOS increases from 2 to 2.4, DMOS from 2.1 to 2.4 and preference is increased from 39% to 61% via A/B test. PESQ score increases from 1 to 1.5 and MCD score decreases from 4.06 to 4.03, relatively for voices synthesized by proposed chaotic mixed excitation source. The relatively better performance of proposed approach is may be due to the novel chaotic mixed source of excitation.

#8 Refined inter-segment joining in multi-form speech synthesis [PDF] [Copy] [Kimi1]

Authors: Alexander Sorin ; Slava Shechtman ; Vincent Pollet

In multi-form speech synthesis, speech output is constructed by splicing waveform segments and parametric speech segments which are generated from statistical models. The decision whether to use the waveform or the statistical parametric form is made per segment. This approach faces certain challenges in the context of inter-segment joining. In this work, we present a novel method whereby all non-contiguous joints are represented by statistically generated speech frames without compromising on naturalness. Speech frames surrounding non-contiguous joints between the waveform segments are re-generated from the models and optimized for concatenation. In addition, a novel pitch smoothing algorithm that preserves the original intonation trajectory while maintaining smoothness is applied. We implemented the spectrum and the pitch smoothing algorithms within a multi-form speech synthesis framework that employs a uniform parametric representation for the natural and statistically modeled speech segments. This framework facilitates pitch modification in natural segments. Subjective evaluation results reveal that the proposed smoothing methods significantly improve the perceived speech quality.

#9 A hierarchical viterbi algorithm for Mandarin hybrid speech synthesis system [PDF] [Copy] [Kimi2]

Authors: Ran Zhang ; Zhengqi Wen ; Jianhua Tao ; Ya Li ; Bing Liu ; Xiaoyan Lou

The hybrid speech synthesis system, which combines the hidden Markov model and unit selection method, has become an additional main stream in state-of-the-art TTS systems. However, traditional Viterbi algorithm is based on global minimization of a cost function and the procedure can end up selecting some poor-quality units with larger local errors, which can hardly be tolerated by the listeners. In Mandarin and many other languages, the naturalness of the region of consecutive voiced speech segments (CVS) is more essential to the overall quality of the synthetic speech. Consequently, in this paper, we proposed to use a hierarchical Viterbi algorithm which involves two rounds of Viterbi search: one is for the sub-paths in the CVS regions; the other is for the utterance path connecting all the sub-paths. In the proposed technique, we defined CVS Region as a region which is formed by two or more voiced phones, and whose observation of pitch has a continuous value. Subjective evaluations suggest that the use of hierarchical Viterbi algorithm in the Mandarin hybrid speech synthesis system outperforms the use of traditional algorithm in both the naturalness and speech quality of synthetic speech.

#10 Automatic animation of an articulatory tongue model from ultrasound images using Gaussian mixture regression [PDF] [Copy] [Kimi2]

Authors: Diandra Fabre ; Thomas Hueber ; Pierre Badin

This paper presents a method for automatically animating the articulatory tongue model of a reference speaker from ultrasound images of the tongue of another speaker. This work is developed in the context of speech therapy based on visual biofeedback, where a speaker is provided with visual information about his/her own articulation. In our approach, the feedback is delivered via an articulatory talking head, which displays the tongue during speech production using augmented reality (e.g. transparent skin). The user's tongue movements are captured using ultrasound imaging and parameterized using the PCA-based EigenTongue technique. Extracted features are then converted into control parameters of the articulatory tongue model using Gaussian Mixture Regression. This procedure was evaluated by decoding the converted tongue movements at the phonetic level using an HMM-based decoder trained on the reference speaker's articulatory data. Decoding errors were then manually reassessed in order to take into account possible phonetic idiosyncrasies (i.e. speaker / phoneme specific articulatory strategies). With a system trained on a limited set of 88 VCV sequences, the recognition accuracy at the phonetic level was found to be approximately 70%.

#11 Articulatory controllable speech modification based on statistical feature mapping with Gaussian mixture models [PDF] [Copy] [Kimi1]

Authors: Patrick Lumban Tobing ; Tomoki Toda ; Graham Neubig ; Sakriani Sakti ; Satoshi Nakamura ; Ayu Purwarianti

This paper presents a novel speech modification method capable of controlling unobservable articulatory parameters based on a statistical feature mapping technique with Gaussian Mixture Models (GMMs). In previous work [1], the GMM-based statistical feature mapping was successfully applied to acoustic-to-articulatory inversion mapping and articulatory-to-acoustic production mapping separately. In this paper, these two mapping frameworks are integrated to a unified framework to develop a novel speech modification system. The proposed system sequentially performs the inversion and the production mapping, making it possible to modify phonemic sounds of an input speech signal by intuitively manipulating articulatory parameters estimated from the input speech signal. We also propose a manipulation method to automatically compensate for unmodified articulatory movements considering inter-dimensional correlation of the articulatory parameters. The proposed system is implemented for a single English speaker and its effectiveness is evaluated experimentally. The experimental results demonstrate that the proposed system is capable of modifying phonemic sounds by manipulating the estimated articulatory movements and higher speech quality is achieved by considering the inter-dimensional correlation in the manipulation.

#12 Speech-driven head motion synthesis using neural networks [PDF] [Copy] [Kimi]

Authors: Chuang Ding ; Pengcheng Zhu ; Lei Xie ; Dongmei Jiang ; Zhong-Hua Fu

This paper presents a neural network approach for speech-driven head motion synthesis, which can automatically predict a speaker's head movement from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a multi-layer perceptron from audio-visual broadcast news data. First, we show that a generatively pre-trained neural network significantly outperforms a randomly initialized network and the hidden Markov model (HMM) approach. Second, we demonstrate that the feature combination of log Mel-scale filter-bank (FBank), energy and fundamental frequency (F0) performs best in head motion prediction. Third, we discover that using long context acoustic information can further improve the performance. Finally, extra unlabeled training data used in the pre-training stage can achieve more performance gain. The proposed speech-driven head motion synthesis approach increases the CCA from 0.299 (the HMM approach) to 0.565 and it can be effectively used in expressive talking avatar animation.

#13 Text-independent voice conversion using speaker model alignment method from non-parallel speech [PDF] [Copy] [Kimi1]

Authors: Peng Song ; Yun Jin ; Wenming Zheng ; Li Zhao

In this paper, we propose a novel voice conversion method called speaker model alignment (SMA), which does not require parallel training speech. Firstly, the source and target speaker models, described by Gaussian mixture model (GMM), are trained, respectively. Then, the transformation function of spectral features is learned by aligning the components of source and target speaker models iteratively. Additionally, the transformation function is further combined with GMM, enabling the multiple local mappings, and a local consistent GMM (LCGMM) is also considered for model training to improve the conversion accuracy. Finally, we carry out experiments to evaluate the performance of the proposed method. Objective and subjective experimental results demonstrate that compared with the well-known INCA approach, the proposed method achieves lower spectral distortions and higher correlations, and obtains a significant improvement in perceptual quality and similarity.

#14 Voice conversion using generative trained deep neural networks with multiple frame spectral envelopes [PDF] [Copy] [Kimi1]

Authors: Ling-Hui Chen ; Zhen-Hua Ling ; Li-Rong Dai

This paper presents a deep neural network (DNN) based spectral envelope conversion method. A global DNN is employed to model the complex non-linear mapping relationship between the spectral envelopes of source and target speakers. The proposed DNN is generatively trained layer-by-layer by cascade of two restricted Boltzmann machines (RBMs) and a bidirectional associative memory (BAM), which are considered as generative models estimated using the contrastive divergence algorithm. Further, multiple spectral envelopes are adopted instead of dynamic features for better modeling using the DNN. The superiority of the proposed method is validated by the subjective experimental results.

#15 Hierarchical modeling of F0 contours for voice conversion [PDF] [Copy] [Kimi1]

Authors: Gerard Sanchez ; Hanna Silen ; Jani Nurminen ; Moncef Gabbouj

Voice conversion systems deal with the conversion of a speech signal to sound as if it was uttered by another speaker. The conversion of the spectral features has attracted a lot of research attention but the conversion of pitch, modeling the speaker-dependent prosody, is often achieved by just controlling the F0 level and range. However, the detailed prosody, including different linguistic units at several distinct temporal scales, can carry a significant amount of speaker identity related information. This paper introduces a new method for the conversion of the prosody, using wavelets to decompose the pitch contour into ten temporal scales ranging from microprosody to the utterance level, which allows modeling the different timings of the prosody phenomena. The prosody conversion is carried out in the wavelet domain, using regression techniques originally developed for the spectral conversion of speech. The performance of the proposed prosody conversion method is evaluated within a real voice conversion system. The results for cross-gender conversion indicate a significant improvement in naturalness when compared to the traditional approach of shifting and scaling the F0 to match the target speaker's mean and variance.

#16 Speech prosody generation for text-to-speech synthesis based on generative model of F0 contours [PDF] [Copy] [Kimi1]

Authors: Kento Kadowaki ; Tatsuma Ishihara ; Nobukatsu Hojo ; Hirokazu Kameoka

This paper deals with the problem of generating the fundamental frequency (F0) contour of speech from a text input for text-to-speech synthesis. We have previously introduced a statistical model describing the generating process of speech F0 contours, based on the discrete-time version of the Fujisaki model. One remarkable feature of this model is that it has allowed us to derive an efficient algorithm based on powerful statistical methods for estimating the Fujisaki-model parameters from raw F0 contours. To associate a sequence of the Fujisaki-model parameters with a text input based on statistical learning, this paper proposes extending this model to a context-dependent one. We further propose a parameter training algorithm for the present model based on a decision tree-based context clustering.

#17 An iterative approach to decision tree training for context dependent speech synthesis [PDF] [Copy] [Kimi1]

Authors: Xiayu Chen ; Yang Zhang ; Mark Hasegawa-Johnson

EDHMM with decision trees is a popular model for parametric speech synthesis. Traditional training procedure constructs the decision trees after observation probability densities have been optimized with the EM algorithm, assuming the state assignment probability does not change much during tree construction. This paper proposes an iterative algorithm that removes the assumption. In the new algorithm, the decision tree construction is incorporated into the EM iteration, with a safeguard procedure that ensures convergence. Evaluation on The Boston University Radio Speech corpus shows that the proposed algorithm can achieve a significantly better optimum in the training set than the original one, and that the advantage is well generalizable to the test set.

#18 Prosodic phrasing modeling for vietnamese TTS using syntactic information [PDF] [Copy] [Kimi1]

Authors: Thi Thu Trang Nguyen ; Albert Rilliard ; Do Dat Tran ; Christophe d'Alessandro

This research aims at modeling prosodic phrasing for improving the naturalness of Vietnamese (a tonal language) speech synthesis. The proposed phrasing model includes hypotheses on: (i) prosodic structure based on syntactic rules (ii) final lengthening linked to syllabic structures and tone types. Audio files in the analysis corpus are manually transcribed at the syllable level and perceived pauses. Text files are parsed and represented with annotated-syntax trees. Statistical treatment brings out a correlation between syntactic element boundaries and pause duration. Major breaks may appear at the end of a clause or between predicates or head elements. Other rules between grammatical phrases/words or shorter clauses may trigger minor breaks. Break levels (including ones predicted by syntactic rules) and relative positions of syllables are used to train VTed, an HMM-based Text-To-Speech (TTS) system for Vietnamese. In the synthesis phase, break levels are explicitly inserted while lengthening is applied for last syllables of prosodic phrases. Perceptive testing shows an increase of 0.34 on a 5 point MOS scale, for the new prosodic informed system (3.95/5) compared to the previous TTS system (3.61/5). In the pair-wise comparison test, about 70% of the synthetic voice with the proposed model is preferred to the previous version.

#19 Accent type and phrase boundary estimation using acoustic and language models for automatic prosodic labeling [PDF] [Copy] [Kimi1]

Authors: Tomoki Koriyama ; Hiroshi Suzuki ; Takashi Nose ; Takahiro Shinozaki ; Takao Kobayashi

This paper proposes an automatic prosodic labeling technique for constructing speech database used for speech synthesis. In the corpus-based Japanese speech synthesis, it is essential to use annotated speech data with prosodic information such as phrase boundaries and accent types. However, manual annotation is generally time-consuming and expensive. To overcome this problem, we propose an estimation technique of accent types and phrase boundaries from speech waveform and its transcribed text using both language and acoustic models. We use conditional random field (CRF) for the language model, and HMM for the acoustic model which has shown to be effective in prosody modeling in speech synthesis. By introducing HMM, continuously changing features of F0 contours are modeled well and this results in higher estimation accuracy than conventional techniques that use simple polygonal line approximation of F0 contours.

#20 Reconstruction of mistracked articulatory trajectories [PDF] [Copy] [Kimi1]

Authors: Qiang Fang ; Jianguo Wei ; Fang Hu

Kinematic articulatory data are important for researches of speech production, articulatory speech synthesis, robust speech recognition, and speech inversion. Electromagnetic Articulograph (EMA) is a widely used instrument for collecting kinematic articulatory data. However, in EMA experiment, one or more coils attached to articulators are possible to be mistracked due to various reasons. To make full use of the EMA data, we attempt to reconstruct the location of mistracked coils with Gaussian Mixture Model (GMM) regression method. In this paper, we explore how additional information (spectrum, articulatory velocity, etc.) affects the performance of the proposed method. The result indicates that acoustic feature (MFCC) is the most effective additional features that improve the reconstruction performance.

#21 Enabling controllability for continuous expression space [PDF] [Copy] [Kimi1]

Authors: Langzhou Chen ; Norbert Braunschweiler

A continuous expression space assumes that each utterance contains individual expressions. Thus, it can be used to model detailed expression information in speech data. However, since an infinite number of different expressions can be contained in the continuous expression space, it is very difficult to manually label them. That means, these expressions are very hard to identify and to extract for synthesising expressive speech. A mechanism to control the continuous expression space is missing. In the discrete expression space though, only a few emotions are defined, thus users can easily choose from these emotions, but the range of expressivity is limited. This work proposes a method to automatically annotate expressions in the continuous expression space based on the cluster adaptive training (CAT) method. Using the proposed method, complex emotion information can be associated to the individual expressions in the continuous space. These emotion labels can be used as indexes of the expressions in the continuous space to enable users to select desired expressions at synthesis time, i.e. enable the controllability for the continuous expression space. Meanwhile, the rich expressive information in the continuous space is kept so that more expressive speech can be generated compared to the discrete space.

#22 Analysis of spectral enhancement using global variance in HMM-based speech synthesis [PDF] [Copy] [Kimi1]

Authors: Takashi Nose ; Akinori Ito

This paper analyzes the problem of the spectral enhancement technique using global variance (GV) in HMM-based speech synthesis. In the conventional GV-based parameter generation, spectral enhancement with variance compensation is achieved by considering a GV pdf with fixed parameters for every output utterances through the generation process. Although the spectral peaks of the generated trajectory are clearly emphasized and subjective clarity is improved, the use of the fixed GV parameters results in a much smaller variation of GVs among the synthesized utterances than that of the natural speech, which sometimes causes undesirable effect. In this paper, we examine the above problem in terms of multiple objective measures such as variance characteristics, spectral and GV distortions, and GV correlations and discuss the result. We propose a simple alternative technique based on an affine transformation that provides a closer GV distribution to the original speech and improves the correlation of GVs of generated parameter sequences. The experimental results show that the proposed spectral enhancement outperforms the conventional GV-based parameter generation in the objective measures.

#23 Intelligibility analysis of fast synthesized speech [PDF] [Copy] [Kimi1]

Authors: Cassia Valentini-Botinhao ; Markus Toman ; Michael Pucher ; Dietmar Schabus ; Junichi Yamagishi

In this paper we analyse the effect of speech corpus and compression method on the intelligibility of synthesized speech at fast rates. We recorded English and German language voice talents at a normal and a fast speaking rate and trained an HSMM-based synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to generated speech. Word recognition results for the English voices show that generating speech at normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, for both blind and sighted participants. These results indicate that using fast speech data does not necessarily create more intelligible voices and that linear compression can more reliably provide higher intelligibility, particularly at higher rates.

#24 Speech synthesis reactive to dynamic noise environmental conditions [PDF] [Copy] [Kimi1]

Authors: Susana Palmaz López-Peláez ; Robert A. J. Clark

This paper addresses the issue of generating synthetic speech in changing noise conditions. We will investigate the potential improvements that can be introduced by using a speech synthesiser that is able to modulate between a normal speech style and a speech style produced in a noisy environment according to a changing level of noise. We demonstrate that an adaptive system where the speech style is changed to suit the noise conditions maintains intelligibility and improves naturalness compared to traditional systems.

#25 Partial representations improve the prosody of incremental speech synthesis [PDF] [Copy] [Kimi1]

Author: Timo Baumann

When humans speak, they do not plan their full utterance in all detail before beginning to speak, nor do they speak piece-by-piece and ignoring their full message — instead humans use partial representations in which they fill in the missing parts as the utterance unfolds. Incremental speech synthesizers, in contrast, have not yet made use of partial representations and the information contained there-in. We analyze the quality of prosodic parameter assignments (pitch and duration) generated from partial utterance specifications (substituting defaults for missing features) in order to determine the requirements that symbolic incremental prosody modelling should meet. We find that broader, higher-level information helps to improve prosody even if lower-level information about the near future is yet unavailable. Furthermore, we find that symbolic phrase-level or utterance-level information is most helpful towards the end of the phrase or utterance, respectively, that is, when this information is becoming available even in the incremental case. Thus, the negative impact of incremental processing can be minimized by using partial representations that are filled in incrementally.